A novel batch-effect correction method for scRNA-seq data based on Adversarial Information Factorization

Single-cell RNA sequencing (scRNA-seq) technology produces an unprecedented resolution at the level of a unique cell, raising great hopes in medicine. Nevertheless, scRNA-seq data suffer from high variations due to the experimental conditions, called batch effects, preventing any aggregated downstream analysis. Adversarial Information Factorization provides a robust batch-effect correction method that does not rely on prior knowledge of the cell types nor a specific normalization strategy while being adapted to any downstream analysis task. It compares to and even outperforms state-of-the-art methods in several scenarios: low signal-to-noise ratio, batch-specific cell types with few cells, and a multi-batches dataset with imbalanced batches and batch-specific cell types. Moreover, it best preserves the relative gene expression between cell types, yielding superior differential expression analysis results. Finally, in a more complex setting of a Leukemia cohort, our method preserved most of the underlying biological information for each patient while aligning the batches, improving the clustering metrics in the aggregated dataset.

Before running the Differential Expression (DE) analysis, we performed K-Means in the t-SNE subspace of the original (Raw (u)) and each model's corrected data.We used Louvain with 50 neighbors for kNN to fairly evaluate the DEGs detection performance when K-Means failed to cluster the data due to the clusters' shape (see Table A).The corresponding metrics are reported in Table B. We also incorporated the results of the supervised DE analysis, i.e. using the true cell type labels, for the original data (Raw (s)).In opposition to what is presented in [1], the batch effects are not high enough to confound the supervised DEG detection, leading to high F1 scores.Thus, the supervised uncorrected DEGs' results would constitute a good sanity check for the batch-effects correction methods' preservation of the biological signal.
We observe that all batch-effect-corrected data yield perfect clustering in terms of ARI, meaning that the DE analysis results solely reflect the models' ability to preserve the cell types' gene expression and are not impacted by the clustering algorithm's performance.The low performance on the original data is due to the mixing of two cell types, as observed in the corresponding t-SNE visualization in Fig A .We note that ResPAN yields inferior batch mixing results, outlined by lower batch LISI scores.

Estimation of the log-fold-change
To further evaluate the models' DE performance, we investigated the faithfulness of the log-foldchange estimation on both versions of the counts, considering all the genes or the HVGs only.AIF dyn outperforms all the other models on the log-normalized counts.The higher quality of estimating the predicted log-fold change can be assessed with the better alignment of the points on the bisector, yielding a lower MSE on both the HVGs and all genes.Seurat batch-effect correction attenuates the biological difference between the two cell types for some genes, resulting in more undetected DEGs, especially for the HVGs.Although scGen, ResPAN, and AIF dyn yield more false DEGs, they correspond to small estimated log-fold-change, which would be easily filtered out with a small log-fold-change threshold.ResPAN's results on all genes resemble the Raw (s) model, as more than half of the genes are left uncorrected, which makes it difficult to assess the quality of the correction truly.On the HVGs, the difference with Raw (s) is more significant as most of the genes are corrected, showing the actual deterioration of the biological signal induced by ResPAN's batch-effect correction compared to the uncorrected data.
On the raw counts, all models yield satisfying estimates of the log-fold change on the HVGs.Still, none surpasses the Raw (s) model since the batch effects do not confound the DE analysis on the HVGs' raw counts (perfect alignment of the points on the bisector).Seurat gives the best results among the corrected data.AIF dyn tends to slightly overestimate the log-fold-change, especially for highly differentiated genes, resulting in a marginally higher MSE than ResPAN but still lower than scGen.On all genes, Seurat and AIF dyn slightly outperform the Raw (s) model, and AIF dyn's log-fold-change estimates are the most aligned on the bisector.ResPAN and scGen yield poorer estimation of the log-fold-change due to high falsely detected DEGs and high undetected DEGs for ResPAN only.
To conclude, AIF dyn provides the corrected data with the best estimation of the log-fold change on the log-normalized counts while yielding top-ranked results on the raw counts.

Evaluation on ResPAN's genes subset
In this section, we focus on the DE results on ResPAN's genes subset only to evaluate the models more fairly.Indeed, since ResPAN does not correct all genes, its intrinsic performance cannot be assessed as such when including the other genes and greatly depends on the results of the uncorrected data.As the Raw (s) model yields great DE results, it unfairly advantages ResPAN, especially on the raw counts.Thus, to assess solely ResPAN batch-effect correction, we computed the metrics on the genes corrected by ResPAN only and compared the models' log-fold-change estimates on this genes subset (Fig D ).First, the difference with the Raw (s) model is more significant when focusing on those genes only since they are no longer diluted by the Raw (s) results on the other genes.It underlines the unfairness of comparison when including all the genes for ResPAN and highlights the biological deterioration induced by ResPAN's batch-effect correction.
On this genes' subset, AIF dyn outmatches all the other models on both the log-normalized and raw counts.The difference in performance is more significant than on the HVGs or on all genes, which emphasizes its superiority.

Figure A .Figure B .
Figure A. Unsupervised clustering on raw data.t-SNE visualization of the uncorrected Dataset 3's raw counts.The cells are colored by batch (left), cell type (middle) or K-Means clusters (right).
Fig C depicts the correlation between the inferred and the actual log-fold-change of differentially expressed genes (DEGs) for each model's corrected data and the corresponding MSE.
Figure C. Comparison of the log-fold-change estimates.Correlation between the inferred and the true log-fold-change of DEGs for each model's corrected raw and log-normalized counts on Dataset 3.

Figure D .
Figure D. Comparison of the log-fold-change estimates on ResPAN's genes subset.Correlation between the inferred and the true log-fold-change of DEGs for each model's corrected raw and log-normalized counts on Dataset 3 restricted to the subset of genes outputted by ResPAN.

Table A .
Clustering algorithm for each model and dataset.

Table B .
Comparison of the methods' clustering results on the full dataset.